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Abstract 

This paper is mainly concerned with the 
question of how to decompose multiclass 
classification problems into binary subprob- 
lems. We extend known Jensen-Shannon 
bounds on the Bayes risk of binary problems 
to hierarchical multiclass problems and use 
these bounds to develop a heuristic proce- 
dure for constructing hierarchical multiclass 
decomposition for multinomials. We test our 
method and compare it to the well known 
"all-pairs" decomposition. Our tests are per- 
formed using a new authorship determina- 
tion benchmark test of machine learning au- 
thors. The new method consistently outper- 
forms the all-pairs decomposition when the 
number of classes is small and breaks even on 
larger multiclass problems. Using both meth- 
ods, the classification accuracy we achieve, 
using an SVM over a feature set consisting of 
both high frequency single tokens and high 
frequency token-pairs, appears to be excep- 
tionally high compared to known results in 
authorship determination. 



1. Introduction 

In this paper we consider the problem of decompos- 
ing multiclass classification problems into binary ones. 
While binary classification is quite well explored, the 
question of multiclass classification is still rather open 
and recently attracted considerable attention of both 
machine learning theorists and practitioners. A num- 
ber of general decomposition schemes have emerged, 



including 'error-correcting output coding' (?; ?), the 
more general 'probabilistic embedding' (?) and 'con- 
straint classification' (?). Nevertheless, practitioners 
are still mainly using the infamous 'one-vs-rest' de- 
composition whereby an individual binary "soft" (or 
confidence-rated) classifier is trained to distinguish be- 
tween each class and the union of the other classes and 
then, for classifying an unseen instance, all classifiers 
are applied and the winner classifier, with the largest 
confidence for one of the classes, determines the clas- 
sification. Another less commonly known method is 
the so called 'all-pairs' (or 'one-vs-one') decomposi- 
tion proposed by (?). In this method we train one 
binary classifier for each pair of classes. To classify a 
new instance we run a majority vote among all binary 
classifiers. The nice property of the "all-pairs" method 
is that it generates the easiest and most natural bi- 
nary problems of all known methods. The weakness 
of this method is that there may be irrelevant binary 
classifiers which participate in the vote. A number 
of papers provide evidences that 'all-pairs' decomposi- 
tions are powerful and efficient and in particular, they 
outperform the 'one-vs-rest' method; see e.g. (?). 

For the most part, known decomposition methods in- 
cluding all those mentioned above are "flat". In this 
paper we focus on hierarchical decompositions. The 
incentive to decompose a multiclass problem as a hier- 
archy is natural and can have at the outset general ad- 
vantages which are both statistical and computational. 
Considering a multiclass problem with k classes, the 
idea is to learn a full binary tree 1 of classes, where each 
node is associated with a subset of the k classes as fol- 



1 In a full binary tree each node is either a leaf or has 
two children. 



lows: Each of the k leaves is associated with a distinct 
class, and each internal node is associated with the 
union of the class subsets of its right and left children. 
Each such tree defines a hierarchical partition of the 
set of classes and the idea is to train a binary classifier 
for each internal node so as to discriminate between 
the class subset of the right child and the class subset 
of the left child. Note that in a full binary tree with k 
leaves there are k — 1 internal nodes. 

Once these tree classifiers are trained, the classifica- 
tion or "decoding" of an new instance can be done us- 
ing various approaches. One natural decoding method 
would be to use the tree in a decision-tree fashion: 
Start with the binary classifier at the root and let this 
classifier determine either its right or left child, and 
this way follow a path to a leaf and assign the class 
associated with this leaf. This approach is particularly 
convenient when using hard binary classifiers giving la- 
beles in {±1}. When using "soft" (confidence-rated) 
and in particular probabilistic classifiers, giving confi- 
dence rates in [0, 1], a natural decoding method would 
be to calculate an estimate for the probability of fol- 
lowing the path from the root to each leaf and then 
use a "winner-takes-all" approach, which selects the 
path with the highest probability. 

Besides computational efficiency, the success of any 
multiclass decomposition scheme depends on (at least) 
two interrelated factors. The first factor is the statisti- 
cal "hardness" of each of the individual binary classi- 
fication problems. The second factor is the statistical 
robustness of the aggregation (or "decoding" ) method. 
The most fundamental measure for the hardness of a 
classification problem is its Bayes error. We attempt 
to use the Bayes error of the resulting decomposition 
and aim to hierarchically decompose the multiclass 
problem so as to construct statistically "easy" collec- 
tion of binary problems. 

Determining the Bayes error of a classification prob- 
lem based on the data (and without knowledge of the 
underlying distributions) is a hard problem, without 
any restrictions (?). In this paper we restrict ourselves 
to settings where the underlying distributions can be 
faithfully modelled as multinomials. Potential appli- 
cation areas are classification of natural language, bi- 
ological sequences etc. We can therefore in principle 
conveniently rely on studies, which offer efficient and 
reliable density estimation for multinomials (?; ?; ?; 
?). As a first approximation, throughout this paper 
we make the assumption that we hold "ideal" data 
smaples and simply rely on maximum likelihood esti- 
mators that count occurrences. 



But even if the underlying distributions arc known, a 
faithful estimation of the Bayes error is computation- 
ally difficult. We rely on known information theoretic 
bounds on the Bayes error, which can be efficiently 
computed. In particular, we use Bayes error bounds 
in terms of the Jansen-Shannon divergence (?) and we 
derive upper and lower bounds on the inherent classi- 
fication difficulty of hierarchical multiclass decomposi- 
tions. Our bounds, which are tight in the worst case, 
can be used as optimality measures for such decompo- 
sitions. Unfortunatclly, the translation of our bounds 
into provably efficient algorithms to search for high 
quality decompositions appear at the moment com- 
putationally difficult. Therefore, we use a simple and 
efficient greedy heuristic, which is able to generate rea- 
sonable decompositions. 

Wc provide initial empirical evaluation of our meth- 
ods and test them on multiclass problems of varying 
sizes in the application area of 'authorship determi- 
nation'. Our hierarchical decompositions consistently 
improve on the 'all-pairs' method when the number of 
classes are small but do not outperform all-pairs with 
larger number of classes. The authorship determina- 
tion set of problems we consider is taken from a new 
benchmark collection consisting of machine learning 
authors. The absolute accuracy results we obtain are 
particularly high compared to standard results in this 
area. 

2. Preliminaries: Bounds on the Bayes 
Error and the Jensen-Shannon 
Divergence 

Consider a standard binary classification problem of 
classifying an observation given by the random vari- 
able X into one of two classes C\ and Ci- Let tt\ and 
7T2 denote the priors on these two classes, -k\ + iT2 = 1 
with ni > 0. Let pi(x) = p{X = x\Ci), i = 1,2, be 
the class-conditional probabilities. If X = x is ob- 
served then by Bayes rule the posterior probability of 

Ci is p(CAx) = tt.p.(^) jj probabilities are 

known we can achieve the Bayes error by choosing the 
class with the larger posterior probability. Thus, the 
smallest error probability is 

min{7Tipi(a;),7r2P2(a;)} 
p{error\x) = rT ~ -— , 

KlPl{x) + 1T2P2\X) 

and the Bayes error is given by psayes — p(error) = 
J x p(x)p(error\x)dx = E x [min{Tr 1 pi(x), 7r 2 p 2 (x)}]. 

The Bayes error quantifies the inherent difficulty of the 
classification problem at hand (given the entire prob- 
abilistic characterization of the problem) without any 



considerations of inductive approximation based on fi- 
nite samples. In this paper we attempt to decompose 
multi-class problems into hierarchically ordered collec- 
tions of binary problems so as to minimize the Bayes 
error of the entire construction. 

2.1. The Jensen-Shannon (JS) Divergence 

Let Pi and P 2 be two distributions over some finite 
set X, and let n = (711,712) be their priors. Then, the 
Jensen-Shannon (JS) divergence (?) of Pi and and P 2 
with respect to the prior ir is 

JS„{P U P 2 ) = HfaPl + 7T 2 P 2 ) - 7Tlff(Pl) - 7T 2 (P 2 ), 

(1) 

where H(-) is the Shannon entropy. It can be shown 
that J 5^ (Pi, P2) is non-negative, symmetric, bounded 
(by H(ir)) and it equals zero if and only if Pi = P 2 . 
According to (?) the JS-divergence was first intro- 
duced by (?) as a dissimilarity measure for random 
graphs. Setting AP- = itiPI + 7r 2 P 2 it is easy to see 
(?) that 

JS{Pi,P 2 ) = mD KL {Pi\\AU) + n 2 D KL (P 2 \\M v ), 

(2) 

where Dkl(-\\-) is the Kullback-Leibler divergence 
(?). The average distribution M n is called the mutual 
source of Pi and P 2 (?) and it can be easily shown 
that 

M n = axgminniD KL (Pi\\Q) + ir 2 D KL {P 2 \\Q), (3) 
w 

That, is the mutual source of Pi and P 2 is the closest 
to both of them simultaneously in terms of the KL- 
divergence. Like the KL-divergence the JS-divergence 
has a number of important roles in statistics and pat- 
tern recognition. In particular, the JS-divergence, 
compared against a threshold is an optimal statisti- 
cal test in the Neyman-Pearson sense (?) for the two- 
sample problem (?). 

2.2. Jensen-Shannon Bounds on the Bayes 
Error 

Lower and upper bounds on the binary Bayes error 
are given by (?). Again, let tt = (wi, tt 2 ) be the priors 
and pi,p 2 , the class conditionals, as defined above. 
Let p(error) be the Bayes error. Set J = H(tt) — 
JStt(pi,p 2 ) with H(tt) denoting the binary entropy 

Theorem 1 (Lin) 

2 J' 2 — P(error) < ^ J (4) 

These bounds arc generalized to k classes in a straight- 
forward manner. Considering a multiclass problem 



with k classes and class-conditionals p\ , . . . , pk and pri- 
ors 7T = (tti, . . . , 7Tfe), the Bayes error is given by 

p(error k ) = / p(x)(l-max{p(Ci \x), . . . ,p(C k \x)})dx. 

J X 

Now setting J k = H(tt) — JSV(pi, . . . ,p k ) we have 
Theorem 2 (Lin) 

4(fc 1 _ i) J fc - P( error k) < \ j- (5) 

Given the true class-conditional, these JS bounds on 
the Bayes error can be efficiently computed using ei- 
ther ((T]) or (J2]) (or their generalized forms). 

3. Bounds on the Bayes Error of 
Hierarchical Decompositions 

In this section we provide bounds on the Bayes er- 
ror of hierarchical decompositions. The bounds are 
obtained using a straightforward application of the bi- 
nary bounds of Theorem [TJ We begin with a more 
formal description of hierarchical decompositions. 

Consider a multi-class problem with k classes C = 
Ci, . . . , Cfc, and let T = (V, E) be any full binary tree 
with k leaves, one for each class. For each node v G V 
we map a label set i{v) C C which is defined as fol- 
lows. Each leaf v (of the k leaves) is mapped to a 
unique class (among the k classes). If v is an internal 
node whose left and right children are vl and vr, re- 
spectively, then ^(d) = £(vl) U 1{vr). Given the tree 
T and the mapping I we decompose the multi-class 
problem by constructing a binary classifier h v for each 
internal node v of T such that h v is trained to discrim- 
inate between classes in E(vl) and classes in E(vr). In 
the case of hard classifiers h v {x) € {±1} and we iden- 
tify '-1' with 'P and '+1' with 'R\ In the case of soft 
classifiers, h v (x) £ [0, 1] and we identify with l L' and 
1 with l R\ Since there are k leaves there are exactly 
k — 1 binary classifiers in the tree. The training set of 
each classifier is naturally determined by the mapping 
I. 

Given a sample x whose label (in C) is unknown, one 
can think of a number of "decoding" schemes that 
combine the individual binary classifiers. When con- 
sidering hard binary classifiers a natural choice to ag- 
gregate the binary decisions is to start from the root 
r and apply its associated classifier h r . If h r (x) = — 1 
we go to and otherwise we go to tr, etc. This 
way we continue until we reach a leaf and predict for 
x this leaf's associated (unique) class. In the case of 



soft binary classifiers a natural decomposition would 
be to consider for each leaf v the path from the root 
to v, and multiply the probability estimates along this 
path. Then the leaf with the largest probability will 
assign a label to x. 

There is a huge number of possible hierarchical decom- 
positions already for moderate values of k. We note 
that a known decomposition scheme which is captured 
by such hierarchical constructions is the decision list 
multiclass decomposition approach (referred to as "or- 
dered one-against-all class binarization" in (?)). 

Consider a fc-way multiclass problem with class con- 
ditionals Pi(x) = P(x\Ci) and priors m, . . . , 7T&. Sup- 
pose we are given a decomposition structure (T, I) for 
k classes consisting of the tree T and the class map- 
ping I. Each internal node v of T corresponds to one 
binary classification problem. The original multiclass 
problem naturally induces class conditional probabili- 
ties and priors for the binary problem at v and we de- 
note these conditionals by p v (x\vl) and p v (x\v R ) and 
the prior by ir(v). For example, denoting the root of 
T by r. we have 

Pr{r L \x) = ^2 p(Ci\x), 
c t ee(r L ) 

with p r (x I r£) = Pr{fL\x)p{x) /t^{tl) by Bayes rule and 
7r (-^) = Ec gf(r L ) n i- Let p v {error) be the Bayes error 
of this problem and denote the Bayes error of the entire 
tree by px (error). 

Proposition 3 For each internal node v of T let 
q(v) = (1 — i J{v)) where 

J(v) = H [ir(v)] - JS n ( v ) [p v (x\v L ),p v (x\v R )} . 

Then 

px(error) < 1 — Q(T), 

where 

Q(T) = q(r) [Q(T L ) + Q(T R )} (6) 
and for a leaf v, Q(v) = 1. 

Proof For each class j, j — 1, . . . , k let v\, v{, . . . , . 
be the path from the root to the leaf corresponding 
to class j, where v\ is the root of T and v' J n . is the 
leaf. This path consists of rij — 1 binary problems. 
The probability of following this path and reaching 
the leaf v nj is 

n j — 1 

P[reaching v 3 n ] = TT (1 — p v , (error)) . 



Thus, the overall average error probability Pt (error) 
for the entire structure (T, £) is 

Px(error) = ^'(l — P[reaching v J nj \) 

= 1 - E*=i n"ir (1 - Pvi (error)). 

Using the JS (upper) bound from Equation (01 on the 
individual binary problems in T we have 

k 

P T (error) < 1 - £ J] (1 - -J(vf)), (7) 

j = l i=l 

where for v = v\ J(v) = H(tt(v)) — 
JSt T ^ v )(pv(x\v l ),p v (x\v r )). Rearranging terms it 
is not hard to see that 

k rij-l 

^ = EII( 1 -2« 

j=i i=i 

■ 

The same derivation now using the JS lower bound of 
Equation (j4j yields: 

Proposition 4 For each internal node v of T let 
q'(v) = (1 — jJ'(v)) where 

J'(v) = (H [tt(v)} - JS„( V ) [p v (x\v L ),p v (x\v R )]f . 
Then 

Pr (error) > 1 — Q'(T), 

where 

Q'(T) = q'(r) [Q'(T L ) + Q'(T R )] 
and for a leaf v, Q(v) = 1. 

4. A Heuristic Procedure for 

Agglomerative Tree Constructions 

The recurrences of Propositions [3] and 0] provide the 
means for efficient calculations of upper and lower 
bounds on the multiclss Bayes error of any tree de- 
composition given the class conditional probabilities 
of the leaves. Our goal is to construct a full binary 
T whose Bayes error is minimal. A natural approach 
would be to consider trees whose Bayes error upper 
bound are minimal. This corresponds to maximizing 
Q(T) ([5]) over all trees T. There are two obstacles for 
achieving this goal. The statistical obstacle is that the 
true class conditional distributions of internal nodes 
are not available to us. The computational obstacle is 
that the number of possible trees is huge. 2 Handling 

2 The number of unlabeled full binary trees with k leaves 
is the Catalan number Ck-i = \( 2 ^.Z^). The number of 
labeled trees (not counting isomorphic trees) is 0(2 k k\). 



the first obstacle in the general case using density es- 
timation technics appears to be counterproductive as 
density estimation is considered harder than classifi- 
cation. But we can restrict ourselves to parametric 
models such as multinomials where estimation of the 
class conditional probabilities can be achieved reliably 
and efficiently; see e.g. (?; ?; ?; ?). In the present 
work we ignore the discrepancy that will appear in our 
Bayes error bounds (even in the case of multinomials) 
and rely on simple maximum likelihood estimates of 
the class-conditionals. 

To handle the maximization of Q{T) we use the fol- 
lowing agglomcrative randomized heuristic procedure. 
We start with a forest of all k leaves, correspond- 
ing to the k classes. Our estimates for the prior 
of these classes TTj, j = 1, ...,k, are obtained from 
the data. We perform k — 1 agglomcrative merg- 
ers as follows. On step i, i = 1, . . . , k — 1 we 
have a forest Fi containing Ni = k — i + 1 trees, 
T\ , . . . , Tjs! i . Each of these trees T has an associated 
class-conditional probability Pt( x ) (which is again 
estimated from the data), and a weight w(T) that 
equals the sum of priors of its leaves. For each pair 
of trees Tj and Tj we compute their JS-divcrgcnce 
JS(i,j) = JS 7l{lj) (P Ti (x),P Tj (x)) where = 
MTO/MTO + w(Tj)), w(TS)/(w(Tj + w(Tj))). For 
each possible merger (between i and j) we assign the 
probability p(i,j) proportional to 2~ JS ^^\ This way 
large JS values are assigned to smaller probabilities 
and vice versa. 3 We then randomly choose one merger 
according to these probabilities. The newly merged 
tree T^- is assigned the mutual source of Tj and Tj as 
its class-conditional (see Equation (|3]l) and its weight 
is w(Ti) + w(Tj). In all the experiments described be- 
low, to obtain a multiclass decomposition we ran this 
randomized procedure 10 times and chose the tree T 
that maximized Q(T). The chosen tree T then deter- 
mines the hierarchical decomposition, as described in 
Section [3J Note that the above procedure does not di- 
rectly maximize Q(T). The routine simply attempts 
to find trees whose higher internal nodes are "well- 
separated" . Such trees will have low Bayes error and 
our formal indication for that will be that Q(T) will 
be large. Thus, currently we can only use our bounds 
as a means to verify that a hierarchical decomposition 
is good, or to compare between two decompositions. 

3 Using a Bayesian argument it can be shown (?) that 
if X and Y are samples with types (empirical probability) 
Pt ; and Pt^, respectively, then 2~ JS ^'^ is proportional 
to the probability that X and Y emerged from the same 
distribution. 



5. The Machine Learning Authors 
Dataset 

In our experiments (Section [6]) we used a new bench- 
mark dataset for testing authorship determination al- 
gorithms. This dataset contains a collection of singly- 
authored scientific research papers. The scientific af- 
filiation of all authors is machine learning, statisti- 
cal pattern recognition and related application areas. 
After this dataset was automatically collected from 
the web using a focused crawler guided by a com- 
piled list of machine learning researchers, it was man- 
ually checked to see that all papers are indeed by sin- 
gle authors. This Machine Learning Authors (MLA) 
dataset. contains articles by more than 400 authors 
with each author having at least one singly-authored 
paper. 4 For the present study we extracted from the 
MLA collection a subset that was prepared as follows. 
The raw papers (given in cither PS or PDF formats) 
were first translated to ascii and then each paper was 
parsed into tokens. A token is either a word (a se- 
quence of alpha numeric characters ending with one of 
the space characters or a punctuation) or a punctua- 
tion symbol. 5 To enhance uniformity and experimen- 
tal control we segmented each paper into chunks of 
paragraphs where a paragraph contains 1000 tokens. 6 
To eliminate topical information we projected all doc- 
uments on the most frequent 5000 tokens. Appearing 
among these tokens are almost all of the most fre- 
quent junction words in English, which bare no topical 
content but are known to provide highly discrimina- 
tive information for authorship determination (?; ?). 
For example, on Figure [T] we see a projected excerpt 
from the paper (?) as well as its source containing 
all the tokens. Clearly there are non-function words 
(like 'data'), which remained in the projected excerpt. 
Nevertheless, since all the authors in the dataset write 
about machine learning related issues, such words do 
not contain much topical content. 

We selected from MLA only the authors who have 
more than 30 paragraphs in the dataset. The result 
is a set of exactly 100 authors and in the rest of the 
paper we call the resulting set the MLA-100 dataset. 

4 The M LA dataset will soon be publicly av ailable at 
http://www.cs.technion.ac.il/~rani/authorshipl 

D We considered as tokens the following punctuations: 

■;,:?!'0"-/V 

6 Last paragraphs of length < 500 tokens were combined 
with second-last paragraphs. This way, paragraphs lengths 
vary in [500, 1499] but a large majority of the paragraphs 
are of exactly 1000 tokens. 



Projected Text 

Over the many have to of data their „their „and their 
..At the same time,„and in many nd complex „such 
as the of data that in .. The of data the of how best 
to use this data to general and to ..Data ::using data 
to and ..The of in data follows from the of several : 

Original Text 

Over the past decade many organizations have be- 
gun to routinely capture huge volumes of historical 
data describing their operations, their products, and 
their customers. At the same time, scientists and 
engineers in many fields find themselves capturing 
increasingly complex experimental datasets, such as 
the gigabytes of functional MRI data that describe 
brain activity in humans. The field of data mining 
addresses the question of how best to use this histor- 
ical data to discover general regularities and to im- 
prove future decisions. Data Mining: using historical 
data to discover regularities and improve future de- 
cisions. The rapid growth of interest in data mining 
follows from the confluence of several recent trends: 

Figure 1. An excerpt from the paper "Machine Learning 
and Data Mining" (?). Top: A projection of the text over 
the high frequency tokens; Bottom: The original text. Ex- 
cerpt is taken from the paper Machine Learning and Data 
Mining (?). 



6. Experiments 

Here we describe our initial empirical studies of the 
proposed multiclass decomposition procedure. We 
compare our method with the "all-pairs' decomposi- 
tion. Taking the MLA-100 dataset (see Section \Q we 
generated a a progressively increasing random subset 
as follows. From the MLA-100 we randomly chose 
3 authors, then added another author, chosen ran- 
domly and uniformly from the remaining authors, etc. 
This way we generated increasing sets of authors in 
the range of 3-100. So far we have experimented with 
multiclass subsets with k = 3 — 20, 50 and 100. In all 
the experiments we used an SVM with an RBF ker- 
nel. The SVM parameters where chosen using cross- 
validation. The reported results are averages of 3-fold 
cross-validation. 

The features generated for our authorship determina- 
tion problems contained in all cases the top 5000 sin- 
gle tokens (see Section [5] for the token definition) as 
well as the following "high order pairs". After pro- 
jecting the documents over the high frequency single 
tokens we took all bigrams. For instance, considering 
the projected text in Figurc[TJ the token pair 'to'+'of 



appearing in the first line of the projected text (top) 
is one of our features. Notice that in the original text 
this pair of words appears 5 words apart. This way our 
representation captures high order pairwise statistics 
of the tokens. Moreover, since we restrict ourselves 
to the most frequent tokens in the text these pairs of 
token do not suffer too much from the typical statisti- 
cal sparsness which is usually experienced when con- 
sidering n-grams in text categorization and language 
models. 

Accuracy results for both "all-pairs" and our hierarchi- 
cal decomposition procedure appear in Figure [2] The 
first observation is that the absolute values of these 
classification results are rather high compared to typ- 
ical figures reported in authorship determination. For 
example, (?) report on accuracy around 70% for de- 
termining between 10 authors of newspaper articles. 
Such figures (i.e. number of authors and around 60%- 
80% accuracy) appear to be common in this field. The 
closest results in both size and accuracy we have found 
are of (?), who distinguish between 117 newsgroup au- 
thors with accuracy 58.8% and between 84 authors 
with accuracy 80.9%. Still, this is far from he 91% 
accuracy we obtain for 50 authors and 88% accuracy 
for 100 authors. 

The consistent advantage of hierarchical decomposi- 
tions over all-pairs is evident for small number of 
classes. However, for over 10 classes, there is no sig- 
nificant difference between the methods. Interestingly, 
the best hierarchical constructs our method generated 
(in terms of the Q(T)) were completely skewed. It is 
not clear to us at this stage whether this is an artifact 
of our Bayes error bound or a weakness of our heuristic 
procedure. 

7. Concluding Remarks 

This paper presents a new approach for hierarchical 
multiclass decomposition of multinomials. A similar 
hierarchical approach can be attempted with nonpa- 
rametcric models. For instance using any nonparamet- 
ric probabilistic binary discriminator one can attempt 
to heuristically estimate the hardness of the involved 
binary problems using empirical error rates and design 
reasonable hierarchical decompositions. However, a 
major difficulty in this approach is the computational 
burden. 

When considering the main inherent deficiency of all- 
pairs decompositions it appears that this deficiency 
should disappear or at least soften when the number of 
classes increases. The reason is that with large number 
of classes, the noisy votings of irrelevant classifiers will 
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Figure 2. The performance of hierarchical multiclass de- 
compositions and 'all-pairs' decompositions on 20 au- 
thorship determination problems with varying number of 
classes. 

tend to cancel out and the power of the relevant clas- 
sifiers will then increase. We therefore speculate that 
it would be very hard to consistently beat all-pairs de- 
compositions with very large number of classes. Never- 
theless, a desirable property of a decomposition scheme 
is scalability, which allows for efficient handling of 
large number of classes (and datasets). For example, 
one can hypothesize useful authorship determination 
applications, which need to determine between thou- 
sands or even millions of authors. While balanced hi- 
erarchical decomposition will be able to scale up to 
these dimensions, the 0(k 2 ) complexity of the all-pairs 
method would probably start to form a computational 
bottleneck. 
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